Predicting gene variant pathogenicity

ABSTRACT

A computer-implemented, gene-specific prediction tool for classifying and interpreting gene tests is described. The prediction tool includes a predictor using a consensus framework. The predictor employs a weighted metric of existing and complementary prediction algorithms and calculated reference intervals of disease outcomes to calculate a consensus score used in interpreting gene tests.

RELATED APPLICATIONS

The present application claims the benefit of priority under 35 U.S.C. §119 from U.S. Provisional Patent Application Ser. No. 61/518,833 entitled “Decision Support for Uncertain Gene Variants,” filed on May 12, 2011, which is hereby incorporated by reference in its entirety for all purposes.

BACKGROUND

Medical genetics involves diagnosis, management, and determination of risk of hereditary disorders. Understanding the genotype-phenotype correlation of gene variants in disease is a major component of medical genetics. In monogenic diseases, gene mutations are typically curated as either “pathogenic” or “benign.” However, many gene variants (i.e., gene mutations) are classified as being “unknown” or “uncertain” because they cannot be clearly associated with a clinical phenotype. Accurate interpretation of gene testing, including accurate phenotype association of gene variants, is an important component in customization of healthcare such that decisions and practices provided to a patient are tailored to the individual patient.

In recent years, various efforts, such as the Human Variome Project, 1000 Genomes, and NCBI Genetic Testing Registry, have resulted in a growing interest in annotation and clinical interpretation of gene variants in human diseases. Further, with rapidly evolving technologies (e.g., Single Nucleotide Polymorphisms (SNP) chip genome wide association studies and next-generation sequencing), genomic analysis has become faster and more cost effective, yielding much larger data sets than previously available. However, there exists a gap between the rapidly growing collections of genetic variation (i.e., genetic mutation) and practical clinical implementation. Further, as genetic information is incorporated into the electronic medical record, new decision support approaches are needed to provide clinicians with a preferred course of treatment. Moreover, for decision support rules to add value, the clinical relevance of laboratory information should be well understood.

Gene variant classification is critical in informing clinicians of the most appropriate course of treatment. To that end, medical geneticists typically rely on patient history and family segregation, literature review and trusted colleagues to stay informed of the phenotype consequences of a given gene variant. Although computer-based prediction methods may be employed to classify gene variants, there still exists a lack of a widely accepted standard computational predictor of mutation severity for novel or uncertain gene variants in clinical use. Further, existing prediction methods, despite being actively used in laboratories, do not offer sufficient accuracy to predict disease phenotype to the degree necessary to be clinically applicable.

In the recent years, updated recommendations on reporting and classification of gene variants, including approaches targeted at determining the clinical significance of variants of uncertain significance, have been proposed from the American College of Medical Geneticists (ACMG). Further, in order to improve interpretation of unclassified genetic variants, definitions and terminology have also been recommended by the International Agency for Research on Cancer (IARC).

Despite these recommendations, terms such as “deleterious,” “mutation,” “pathogenic,” or “causative of disease” are still being used in reporting genetic tests. Further, test results such as “indeterminate,” “unknown,” “uncertain,” “unclassified,” or “undetermined” render interpretation of the significance of a gene test result difficult. Further compounding this issue, word modifiers such as “likely,” “suspected,” “predicted,” “mild,” “moderate,” or “severe” often are used to accompany variant classification.

The lack of a quantitative metric or a standardized scale for evaluation of novel or uncertain gene variants render test result interpretation difficult and subjective to location and expertise at hand. A second and closely related challenge is the lack of an objective and standardized framework or context to make that metric meaningful. The quantitative metric and framework for evaluation become especially critical for interpretation of novel and uncertain gene variants where there is the obvious lack of traditional or existing evidence such as family history, pedigree trios or sib pairs, confirming literature reports, bench assay biochemical evidence, or colleague consensus of disease association.

SUMMARY

Certain embodiments of the present invention relate to a gene-specific prediction tool for classifying and interpreting of a gene test. The prediction tool may include a predictor implemented using a consensus framework. In certain embodiments, the consensus framework may include a weighted metric of existing and complementary prediction algorithms and calculated reference intervals of known disease outcomes.

In certain embodiment, a plurality of classifiers for interpreting a genetic test is selected from among a selection of classifiers. Each classifier may be trained using data relating to gene variants and their known phenotypes. An interpretation of the genetic test may be obtained from each selected classifier. A correlation matrix including numerical correlation of each pair of classifiers may be determined. The numerical correlation may indicate a correlation between the pair of classifiers. The factor analysis of the correlation matrix may be performed to determine whether the interpretation obtained from a classifier is statistically independent from the interpretations of remaining classifiers. In an event the classifier is not statistically independent from the interpretations of remaining classifiers; a consensus score for each classifier may be obtained. The consensus score may be obtained as a function of a reduced matrix obtained from the factor analysis of the correlation matrix. An overall consensus score may be calculated using the consensus scores determined for each classifier. The overall consensus score is compared against a predetermined range. In an event the overall consensus score falls within the predetermined range, the genetic test interpretation associated with the predetermined threshold is reported as an outcome of the genetic test.

In certain embodiments, the consensus score for each classifier may be obtained as a function of a linear sum of the interpretations obtained from the classifier in an event the interpretations of all classifiers are statistically independent from one another.

In some embodiments, the consensus score for each classifier may be obtained as a function of a scalar product of the reduced matrix and the interpretation obtained from the classifier.

In some embodiments, performing factor analysis of the correlation matrix includes performing at least one of regression modeling, common factor analysis, or principle component analysis of the correlation matrix.

In some embodiments, a database including the data relating to gene variants and their known phenotypes may be accessed and ach classifier may be trained using the data obtained from the database. In some embodiments, the data relating to gene variants and their known phenotypes may include Rearranged During Transformation (RET) proto-oncogene data.

In certain embodiments, the predetermined range may be determined using a first numerical consensus score reference interval determined for known benign phenotypes associated with gene variants and a second numerical consensus score reference interval determined for known pathogenic phenotypes associated with gene variants. In some embodiments, the overall consensus score may be compared against the first and second numerical consensus score reference intervals and a benign gene test outcome may be reported in an event the consensus score falls within the first numerical interval or a pathogenic gene test outcome may be reported in an event the consensus score falls within the second numerical interval.

In some embodiments, a numerical representation of the interpretation obtained from each classifier may be obtained. The numerical representation may include at least one of mean, median, standard deviation, minimum, and maximum of numerical values output from the classifier. In certain embodiments, the correlation matrix may be obtained as a function of the numerical representations of the plurality of the classifiers. In some embodiments, the numerical correlation between each pair of classifiers may be obtained by determining a Spearman's rank correlation coefficient for the pair of classifier.

In some embodiments, the genetic test interpretation may be reported by graphically displaying the overall consensus score on the predetermined range. In certain embodiments, the consensus scores of the plurality of classifiers may be displayed on a radial plot.

In some embodiments, in an event the overall consensus score falls outside of the predetermined range, an uncertain outcome for the genetic test may be reported. In some embodiments, the overall consensus for the uncertain results may be displayed over a predetermined range. In certain embodiments, the consensus scores of the plurality of classifiers may be displayed on a radial plot. Visualization of the Consensus output may be used to augment available clinical information and assist in improving prediction algorithms as gene variant knowledge increases.

The advantages and novel features are set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The advantages of the present teachings may be realized and attained by practice or use of the methodologies, instrumentalities and combinations described herein.

It is understood that other configurations of the subject technology will become readily apparent to those skilled in the art from the following detailed description, wherein various configurations of the subject technology are shown and described by way of illustration. As will be realized, the subject technology is capable of other and different configurations and its several details are capable of modification in various other respects, all without departing from the scope of the subject technology.

Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawing figures depict one or more implementations in accord with the present teachings, by way of example only, not by way of limitation. In the figures, like reference numerals refer to the same or similar elements.

FIG. 1 is a high-level block diagram of an embodiment of the present invention for interpreting a gene test.

FIG. 2 illustrates RET protein domains and their reported disease causing variants as associated with different MEN2 phenotypes.

FIG. 3 is a flow diagram of the procedures for interpreting gene test results using a PSAAP classifier.

FIG. 4 is a table that summarizes performance of various classifiers in interpreting gene test results using a dataset of RET gene variant-disease data.

FIG. 5 is a high-level block diagram of procedures for determining a consensus score of multiple classifiers according to certain embodiments disclosed herein.

FIGS. 6A-6B illustrate the use of radar plots for consensus scoring.

FIGS. 7A-7B include examples of a comprehensive display for consensus scoring.

DETAILED DESCRIPTION

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, it will be apparent to those skilled in the art that the subject technology may be practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology. Like components are labeled with identical element numbers for ease of understanding.

FIG. 1 is a high-level block diagram 100 of an embodiment of the present invention for interpreting a gene test. A user 102 of a computing device 101 may access a database 150 to obtain gene variant-disease data 160 from the database 150. The term “gene variant” refers to a specific form of alteration/variation in the normal sequence of a gene. Genetic variations among individuals may occur on different scales, ranging from variations in the number and appearance of the chromosomes to nucleotide. Although the significance of a gene variant or a gene variation is often unclear, in some cases, available studies of genotypes and their corresponding phenotype may be used to determine the significance of a gene variant. The database 150 may store, collect, and/or display information regarding gene variants and their possible disease association (i.e., gene variant-disease data).

The database 150 may be local or remote to the computing device. Although not shown in FIG. 1, in certain embodiments, the user 102 may access more than one database of gene variant-disease data, each of which may be local or remote to the computing device. In some embodiments, the computing device user 102 may access a remote database 150 via a network (e.g., band limited communications network). In some embodiments, the computing device user may submit a request for the gene variant-disease data 160 to the database 150 and receive the data 160 from the database in response to that request.

The computing device 101 may include a machine-implemented predictor tool 105 which may be used to interpret a gene test. The prediction tool 105 may include a consensus classifier 110 that is responsible for performing the procedures required for classifying and interpreting the gene test. The details of the consensus classifier 110 are described later with reference to FIGS. 5-7.

In some embodiments, the interpretation results may be reported to the computing device user 102 on the display 103 of the computing device 101. In certain embodiments, in addition to or in place of displaying the gene test interpretation results, the computing device 101 may employ other reporting schemes known in the art to report the gene test interpretation results.

The term “gene test,” as used herein, refers to a test involving examination of deoxyribonucleic acid (DNA) molecules in search for genetic disorders, identifying individuals carrying a copy of a gene that may be responsible for a disease (carrier screening), diagnostic testing, pre-symptomatic testing, etc.

In some embodiments, the database 150 may store information on single nucleotide variant and their disease association. Generally, a “single nucleotide variant” (SNV) or a “single nucleotide polymorphism” (SNP) refers to variations occurring when a single nucleotide in the genome differs between paired chromosomes of an individual or members of a biological species. Further, the term “non-synonymous single nucleotide polymorphism” (nsSNP) may be used to refer to a point mutation or a change in amino acid sequence as compared to a wild type or reference sequence.

Certain nsSNP variants have been shown to be causative of disease. Therefore, investigating the functional effect of SNP has been of interest for many years. Due to the cost, labor, and expertise required for wet-bench molecular evaluation, computational tools have been used to assist in investigating the functional effect SNP. These computational tools often focus on SNP variants in protein coding regions that change one amino acid for another. The severity of a given amino acid sequence change may range from mild to severe, and has been reported to impact various medical areas, including genetic disease susceptibility (e.g., sickle cell anemia), common disease risks (e.g., Alzheimer's disease risk), or drug sensitivities, as seen in Warfarin treatment. Historically, physical and chemical properties of amino acids have been used as a proxy to assess the functional impact of these substitution mutations.

Some early efforts in predicting amino acid substitution effects focused on metrics of estimating the expected evolutionary distance between each possible amino acid pair. For example, Point Accepted Mutation (PAM) matrices have been used to approximate the evolutionary distance and frequency of amino acids for equivalent protein positions in closely related species. Each matrix, in PAM matrices, includes a number of standard amino acids in corresponding rows and columns, such that the value in a given cell represents the probability of having one amino acid substituted for another. Such matrices are commonly referred to as “substitution” matrices.

A substitution matrix may be used to derive a scoring matrix that may be used to assess the similarities between two aligned sequences. The Blocks of Amino Acid Substitution Matrix (BLOSUM) is an example of a substitution matrix that may be used for sequence alignment of proteins. BLOSUM considers highly conserved protein regions and may be used for more distantly related species. Both PAM and BLOSUM employ raw mutation rates to compute a score for each amino acid substitution and calculate the likelihood that the mutation is caused by an evolutionary change (i.e., over time) and not by sheer chance. Further, these substitution matrices assume that substitutions that are consistent with evolutionary trends conserved across many species are less likely to disrupt protein function. Conversely, substitutions that are not consistent with evolution (i.e., non-conserved substitutions) are more likely associated with disease.

Alternative approaches utilizing amino acid properties have considered how physiochemical properties differ with changes in volume, hydrophobicity, net charge, packing density, and solvent accessibility all shown to correlate with predicted functional impact of SNP variants. For example, the Grantham distance method combines the biophysical properties and evolutionary distances between amino acid pairs, in a setting where the significance of the amino acid substitution is quantified in a three-dimensional (3D) space. Specifically, the significance of the amino acid substitution is quantified, as a weighted Euclidean distance, in the three-dimensional space having amino acid side chain composition, polarity and volume as coordinates. The weighted Euclidean distance is modeled to estimate amino acid substitution mutation rates.

Further, some computational algorithms focus on the fact the importance of the evolutionary distance separating a pair of amino acids depends on the position where an amino acid substitution occurs. Specifically, amino acid distribution at equivalent positions in a protein family is functionally or structurally important, where these positions may not tolerate a variety of amino acid changes. These equivalent positions may be found by constructing an alignment from multiple related protein sequences. Thus, amino acid residues in highly conserved alignment may be assumed to be under some purifying evolutionary selection and important for normal protein function. Computational algorithms may be used to quantify this conserved evolutionary selection in protein activity, such as calculating the frequency of the most common amino acid in an alignment column. For example, Shannon entropy may be used to compute the distribution of all amino acids at a specific aligned position. This idea may further be improved by using relative entropy to augment comparing Shannon entropy of a conserved alignment against the Shannon entropy of the amino acid background distribution.

Some mutation prediction computational algorithms and prediction scoring tools for interpreting gene test consider both physicochemical properties of amino acid substitution and evolutionary conservation. Examples of such methods include Sorting Intolerant From Tolerant (SIFT), Position-Specific Independent Counts (PSIC), Align Grantham Variation Grantham Distance (AGVGD), and Multivariate Analysis of Protein Polymorphism (MAPP) score.

The SIFT algorithm may be used to compute a weighted frequency average of which amino acid residue appears in a multiple alignment position, coupled with an estimate of unobserved variant frequencies.

The PSIC profile score method considers the difference of likelihood between reference and variant amino acid at a given aligned position using a position-specific scoring matrix (PSSM).

The AGVGD method is an extension of the original Grantham distance method that may be used in multiple sequence alignments and true simultaneous multiple comparisons. Grantham variation (GV) may be computed by replacing each value-pair of a given amino acid residue component for composition, polarity, and charge with the maximum and minimum value in that alignment position.

The MAPP score constructs a statistical summary of an alignment column by use of phylogenetic tree and tree topology weighting each sequence by branch length.

Furthermore, some computational algorithms consider protein structure-function relationships of amino acid substitution. For example, solvent accessibility of an amino acid may be used as a predictor of functional impact, where substituting various amino acid residues may disrupt the hydrophobic core of a soluble protein. Structural modeling of disease proteins may be used to determine whether a nsSNP variant results in protein backbone strain or leads to overpacking substitutions. A large number of X-ray crystal structures have been determined which often include protein interacting partners, and/or small molecule, peptide ligands or inhibitors. The ability to locate a nsSNP variant on a computational protein structure makes it possible to evaluate whether the amino acid substitution occurs in or near a binding or catalytic site or at a domain-domain interface of protein interaction.

Polymorphism Phenotyping (PolyPhen) is an example of an algorithm that takes advantage of structural modeling. Specifically, PolyPhen is an automated tool that may be used to evaluate any possible impact of amino acid substitution on the structure and function of a human protein. PolyPhen uses a Dictionary of Secondary Structure in Proteins (DSSP) to map a given substitution site to known protein 3D structures.

Mutation Prediction (MutPred) is another example of a prediction algorithm that may be used with the embodiments described herein. MutPred generates mutability profiles of amino acid sequences from the corresponding complementary DNA sequences and generates weighted and un-weighted profiles. In the weighted profiles relative mutabilities are multiplied by the likelihood of clinical detection depending on chemical differences.

PMUT is another example of a mutation prediction algorithm. PMUT uses a two layer neural network and is trained using human mutational data. PMUT allows for either prediction of single point amino acidic mutations or scanning of mutational hot spots. Results are obtained by alanine scanning, identifying massive mutations, and genetically accessible mutations.

Although clinicians often rely on patient history, family segregation, literature review and trusted colleagues to stay informed of the phenotypic consequences of a given gene variant found in a gene test, in absence of traditional evidence, well established machine learning or computational tools may be used to predict and access phenotypic consequences of the gene variant. However, established algorithms do not always complete the prediction, and furthermore are not always in agreement with the curated data or each other.

RET (Rearranged During Transformation) proto-oncogene data are an example of the gene-disease data 160 that may be used with embodiments of the present invention. In some embodiments, well-curated gene variant collections, such as RET data, may be used. Further, in some embodiments, physicochemical properties of amino acids in the coded proteins may be utilized to determine mutation severity.

The RET oncogene is located on chromosome 10q11, with 21 exons coding a full length protein of 1,114 amino acids. Conserved functional domains found within the protein include a signal peptide, cadherin repeat domains, transmembrane domain, and protein tyrosine kinase. Mutations in the RET oncogene have been directly associated with Multiple Endocrine Neoplasia type 2 (MEN2), a hereditary thyroid carcinoma syndrome. Although well known mutations often guide patient therapy and surgical options, other RET sequence mutations vary in functional severity. Some mutations may be pathogenic, some may be benign, and some may be of unknown significance. Curated RET oncogene mutations for MEN2 have been reported, many of which have documented phenotype outcomes.

FIG. 2 illustrates RET protein domains and their reported disease causing variants as associated with different MEN2 phenotypes. Specifically, conserved domains of signal peptide (SP), cadherin repeat domains (CAD), cysteine rich region (CYS), transmembrane domain (TM), and protein tyrosine kinase (Kinase) are shown. Three specific disease phenotypes have been shown to be associated with these domains. Specifically, familial medullary thyroid cancer (FMTC), multiple endocrine neoplasia type 2A (MEN2A), and multiple endocrine neoplasia type 2B (MEN2B) have been shown to correspond to CYS and Kinase domains.

The RET gene belongs to the cadherin super family and encodes a receptor tyrosine kinase which functions in signaling pathways for cell growth and differentiation. The RET gene plays a critical role in neural crest development and may undergo oncogenic activation, in vivo and in vitro, by cytogenetic rearrangement. The RET gene may further be classified by Gene Ontology (GO) categories of biological process of homophilic cell adhesion, posterior midgut development, and protein amino acid phosphorylation. The GO annotated cellular location of the RET is component integral to membrane and the GO category of molecular functions lists ATP binding, calcium ion binding and transmembrane receptor protein tyrosine kinase activity.

As explained above, to date, various computational algorithms and prediction scoring tools for classifying gene test results (e.g., SIFT, PSIC, AGVGD, MutPred, or MAPP) have been developed. Traditional classification schemes may also be used to classify and interpret gene test results. For example, classifiers such as Zero Rules (ZeroR), naive Bayesian, Simple Logistic Regression (Simple Logistic), Support Vector Machine (SMO), k-nearest neighbor (IBk), and Random Forest Regression (Random Forest) may be used to interpret gene test results.

In one embodiment of the present invention, curated RET gene-disease data are used to train, test, and verify performance of various mutation classification and prediction tools (e.g., SIFT, PSIC, AGVGD, MutPred, or MAPP) as well as various traditional classification tools (e.g., ZeroR, Simple Logistic, SMO, IBk, or Random Forest). In one embodiment, k-fold cross validation may be used to access classifier performance. In k-fold cross validation, the original dataset is partitioned into k samples and of the k samples, k−1 subsamples are used as training data for training the classifier. The cross validation is repeated k times, during which each of the k samples used once. The resulting k outcomes are averaged to produce a single estimation. In one embodiment, the weighted average from a three fold cross validation of sensitivity (i.e., k=3, true positive rate), specificity (true negative rate), and positive predictive value (precision) may be calculated for each classifier algorithm. Specifically, assuming that the probability of having a true detection (hit) is p, the probability of having a false detection (miss) is q=1−p. The probability of having a true positive may be calculated as p² and the probability of having a false negative may be calculated as pq. Using this definition, the sensitivity of the classifier is p²/(p²+pq)=p and the specificity of the classifier is q²/(q²+pq)=q. The performance of the classifier (i.e., predictive positive value) may be measured as a function of the sensitivity and specificity.

Primary Sequence Amino Acid Properties (PSAAP) classifier is another example of a classification and prediction tool that may be used to interpret gene test results. The details of the PSAAP prediction algorithm are described in Attorney Docket No. 076950-0130, U.S. patent application Ser. No. 13/471,294, filed on May 14, 2012, the teaching of which is hereby incorporated by reference in its entirety for all purposes.

FIG. 3 is a flow diagram of the procedures for interpreting gene test results using a PSAAP classifier. The RET variant data may be used to test and train the PSAAP algorithm 305. In some embodiments, non-synonymous RET variant data may be used 305. Non-synonymous RET variants are characterized by physicochemical differences in primary amino acid sequence resulting from the mutation. Regardless of the type of RET variant data used, the RET variant data may include exonic nsSNP variants 310 with known outcomes of benign and pathogenic 320. Attribute selection (feature selection) 340 may be performed to select a subset of relevant features that may be used for classification. Specifically, attributes of mutation status may be characterized using values of physical, chemical, conformational, or energetic properties of the genes. Attribute selection (feature selection) 340 may be performed during classification training/testing.

The properties used in attribute selection 340 may be obtained from an AAindex database 330. AAindex 330 is a database of numerical indices that represents various physicochemical and biochemical properties of amino acids and pairs of amino acids. For each RET variant, matrices of delta values 335 for each biochemical property of the substituted amino acid are calculated using the corresponding AAindex 330. The resulting mutation are described by an array of variables, archived using a structured query language (SQL), that corresponds to the absolute value of the difference between the value of the property in the amino acid present in the wild type and the one in the mutant.

Random selection may be used to build a training set 350 and a test set 360. Although training and test sets include different disease subtypes such as MEN2A, MEN2B, FMTC, MEN2A and FMTC, class labels of “pathogenic” and “benign” 320 may be used to describe all curated disease association.

A classifier, such as a naive Bayesian classifier 355, may be employed to classify the variants. Specifically, the training set 450 may be used to train the classifier 355. The test set 360 is then tested using the classifier 355 and the outcome of the test is used to assign disease association 365 to the gene variants in the test set 360. Uncertain variants 370 are also analyzed and their predicted disease association 375 is output from the PSAAP classifier.

The performance of the PSAAP algorithm may be evaluated using calculated values of sensitivity (true positive rate), specificity (true negative rate), and positive predictive value (precision).

FIG. 4 is a table that summarizes performance of various classifiers (e.g., mutation prediction tools and traditional classifiers) in interpreting gene test results using a dataset of RET gene variant-disease data. The classifier performance is ranked by positive predictive value (PPV) or the percentage of variants classified as pathogenic that actually were pathogenic.

For the dataset used, the ZeroR classifier (zero rules), which selects the majority class by default, yields a baseline performance of 55.7%. The nearest neighbor, random forest, support vector machine, and simple logistic give similar performance to each other with 77.6%, 78.9%, 79.1%, and 81.4% respectively. The naïve Bayesian classifier appears to be the best performing algorithm with a positive predictive value of 82.7%, which translates to a gain in performance of 27% over the ZeroR classifier. Further, as shown, the traditional classifiers (e.g., ZeroR, IBk, Random Forest, SMO, Simple Logistic, Naïve Bayesian) perform better than or similar to the existing mutation prediction algorithms (e.g., PolyPhen, SIFT, MutPred, and PMUT). Specifically, the PolyPhen, SIFT, MutPred, and PMUT result in positive predictive values of 54.1%, 77.9%, 84.3%, and 72.3%, respectively. The PSAAP prediction algorithm yields the highest performance at 8.3%.

FIG. 5 is a high-level block diagram of procedures for determining a consensus score for multiple classifiers according to certain embodiments disclosed herein. The consensus classifier 110, shown in FIG. 1, may employ the procedures shown in FIG. 5 to classify and interpret test results.

Various classifiers 510-1, . . . , 510-n, such as those outlined in FIG. 4, may be used. In one embodiment, a user may select a finite number of classifiers from among various available classifiers. For example, in one embodiment, the MutPred, PMUT, PolyPhen, SIFT, and PSAAP classifiers may be used. Various combinations of classifiers may be used with embodiments of the present invention. The classifiers outlined in FIG. 4 are intended to serve a non-limiting examples of classifiers that may be used with the embodiments disclosed herein. One skilled in the art appreciates that any classifier and prediction tool known in the art may be used with the embodiment disclosed herein.

As explained above, each classifier 510-i may be trained and tested with gene variant-disease data 160 (FIG. 1), such as RET gene-disease data. Complementary methods (not shown, e.g., sequence alignment, structural alignment, amino acid substitution penalties, structural disruption, sequence homology, etc.) may be applied to the data prior to performing classification and prediction.

Gene test data 501 from genetic testing may be input to the classifiers 510-1, . . . , 510-n. A descriptive statistics calculator 530 may calculate descriptive statistics values 530-i for each classifier outcome (hereinafter referenced as 510-i) using the numerical output obtained from each classifier 510-i. The descriptive statistics values 530-i for each classifier may include elements such as mean, median, standard deviation, minimum, and maximum of the numerical values output from that classifier. The descriptive statistics 530-i of each classifier 510-i may be used to summarize and quantitatively represent the features of the data collected from each classifier 510-i.

A correlation calculator 550 may calculate the correlation between each pair of classifiers. For example, if five classifiers (C1, C2, C3, C4, C5) are being used, the correlation between classifiers (C1, C2), (C1, C3), (C1, C4), (C1, C5), (C2, C3), (C2, C4), (C2, C5), (C3, C4), (C3, C5), and (C4, C5) may be obtained. In some embodiments, the obtained correlation coefficients 555-i may be arranged into a correlation matrix. In certain embodiments, the correlation coefficients 555-i may be obtained using the descriptive statistics 530-i obtained from each classifier 510-i. In some embodiments, a Spearman's rank correlation coefficient may be calculated between the classifier pairs and used as a numerical or non-parametric measure of the statistical dependence between the classifier pairs. One skilled in the art appreciates that other available methods for determining correlation dependence may be used in addition to or in place of Spearman's correlation to determine the correlation coefficient 555-i.

Further, a variance analyzer 560 may be used to determine the variance between the independent classifiers 510-1, . . . , 510-n using the descriptive statistics 530-i obtained from each classifier 510-i. For example, Factor Analysis may be used to describe the variance among the independent classifiers 510-1, . . . , 510-n. Factor analysis describes the variability among the classifiers in terms of a number of variables or “factors.”

Factor analysis may be performed using principal components to determine the weights of association between the different classifiers 510-i. Specifically, in one embodiment, a set of eigenvectors is applied to weight each classifier 510-i accordingly by eigenvalues from principal components, with more than 80% of the cumulative variance reached using only the first three eigenvalues.

Further, to compensate for the lack of independence between the variables (i.e., descriptive statistics values 530-i), a weighted average calculator 570 may use the resulting correlation rank 555-i and variance 565-i values for each classifier to determine a weighted average 575-i for each classifier. A consensus score calculator 580 may determine a consensus score 585-i for each classifier based on weighted average score of each classifier 510-i.

For example, in one embodiment, the classifiers 510-1, . . . , 510-n are used to determine a numerical prediction for each gene variant. The numerical prediction of each classifier is then used to obtain descriptive statistics 530-i for each classifier 510-i. Spearman correlation coefficients, describing the correlation of each classifier with other classifiers are then calculated. The Spearman correlation coefficients may be arranged into a matrix whose eigenvalues are obtained during principle component analysis. Once principle component analysis is performed, the obtained matrix is analyzed to determine if the classifiers are independent of one another. In that case, a consensus score for each classifier may be obtained as a linear sum of the numerical outcomes obtained from that classifier. If the classifiers are determined to be dependent, a consensus score for each classifier as a function of a reduced matrix may be obtained from the principle component analysis of the correlation matrix. Specifically, a scalar product (inner product or dot product) of the reduced matrix and the numerical outcomes of a classifier may be used to obtain the consensus score for that classifier.

A minimum number of eigenvectors that cumulatively explain the variance of the independent classifiers are selected with respect to a predetermined threshold. The reduced set of eigenvectors is used to create a weighted average sum for each classifier. Specifically, the weighted average sum, in some embodiments, may be calculated by multiplying (via an inner or dot product) the reduced eigenvector by the descriptive statistics of each classifier. This results in a single number that may be used as the “consensus score” of a classifier.

In some embodiments, a reference range for the consensus score may be determined. Specifically, a reference range may be defined such that if the consensus score for a gene test falls within that range, the gene variant is classified as pathogenic and if the consensus score for a gene test falls outside of that range it is classified as benign.

For example, the reference score may be calculated for RET gene variants with known disease outcome with analogy to calculating analyte reference intervals for age or gender in traditional laboratory testing. A nonparametric reference interval may be used for benign (n=46) and pathogenic (n=51) with 95% confidence intervals (CI) for the lower and upper bounds. The confidence ratio of the reference interval may also be calculated.

The overall consensus score may be used to augment the events in which a gene-specific classifier does not outperform the existing tools. This advantage of consensus predictor over a single predictor may be seen by removing seven RET gene variants with known disease association where originally they were classified as variants of uncertain significance. After excluding these seven variants from the gene-specific training set, analysis using the Consensus Score Calculator is repeated. Under such setup, the Consensus score correctly predicts the sixth variant. Closer inspection showed the remaining seventh variant is a nucleotide level “silent” polymorphism (no amino acid change), which could have been recognized by spice effect prediction software.

In some embodiments, a graphing display may be used in the consensus score calculator 580 to preserve contribution of each variable (classifier). For example, radial plots (also known as radar or spider plots) may be employed.

FIGS. 6A-6B illustrate the use of radar plots for consensus scoring. As shown in FIGS. 6A-6B, using radar plots for consensus scoring may preserve the contribution of each predictor to the total sum. For example, as shown in FIG. 6A, consensus score plot of 470 (85, 90, 98, 97, 100) for the pathogenic gene variant C609Y is obtained. As shown in FIG. 6B, consensus output of 83 (7, 13, 19, 4, 40) for a benign variant V376A is obtained.

Further, a more comprehensive display for consensus scoring may be used for augmenting clinical decision making FIGS. 7A-7B include examples of such display. As shown, the display may incorporate features such as classifier output, predictor calls, weighted sum, and be presented in a color-metric scale. In FIG. 7A a pathogenic gene variant C634R having scoring of 367 is shown and in FIG. 7B a benign variant G691S with a Consensus score of 97 is shown.

Those of skill in the art would appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. Various components and blocks may be arranged differently (e.g., arranged in a the subject technology.

It is understood that the specific order or hierarchy of steps in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged. Some of the steps may be performed simultaneously. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The previous description provides various examples of the subject technology, and the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.

A phrase such as an “aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. An aspect may provide one or more examples. A phrase such as an aspect may refer to one or more aspects and vice versa. A phrase such as an “embodiment” does not imply that such embodiment is essential to the subject technology or that such embodiment applies to all configuration of the subject technology. A disclosure relating to an embodiment may apply to all embodiments, or one or more embodiments. An embodiment may provide one or more examples. A phrase such an embodiment may refer to one or more embodiments and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A configuration may provide one or more examples. A phrase such a configuration may refer to one or more configurations and vice versa.

All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. §112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” Furthermore, to the extent that the term “include,” “have,” or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.

Various modifications may be made to the examples described in the foregoing, and any related teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings. 

1. A method of interpreting biologic information, comprising: selecting a plurality of classifiers, each trained using data relating to a gene variant and to a respective phenotype associated with the variant, for interpreting a genetic test; obtaining an interpretation of the genetic test from each selected classifier; determining a correlation matrix including a numerical correlation of each of a plurality of pairs of classifiers, the numerical correlation indicating a correlation between members of the respective pair; by a processor, performing factor analysis of the correlation matrix to determine whether the interpretation obtained from a classifier is statistically independent from interpretations of remaining classifiers, and, in an event the classifier is not statistically independent from the interpretations of the remaining classifiers, obtaining a consensus score for each classifier as a function of a reduced matrix obtained from the factor analysis of the correlation matrix; and comparing an overall consensus score, calculated using the consensus scores determined for each classifier, against a predetermined range including a threshold; and when the overall consensus score falls within the range, reporting a genetic test interpretation associated with the threshold as an outcome of the genetic test.
 2. The method of claim 1, further comprising obtaining the consensus score for each classifier as a function of a linear sum of the interpretations obtained from the classifier in an event the interpretations of all classifiers are statistically independent from one another.
 3. The method of claim 1, further comprising obtaining the consensus score for each classifier as a function of a scalar product of the reduced matrix and the interpretation obtained from the classifier.
 4. The method of claim 1, wherein performing factor analysis of the correlation matrix includes performing at least one of regression modeling, common factor analysis, or principle component analysis of the correlation matrix.
 5. The method of claim 1, further comprising: accessing a database including the data relating to a gene variant and its phenotype, and training each classifier using the data obtained from the database.
 6. The method of claim 1, wherein the data relating to the gene variant and its phenotypes include Rearranged During Transformation (RET) proto-oncogene data.
 7. The method of claim 1, further comprising determining the predetermined range by determining a first numerical consensus score reference interval for a benign phenotype associated with a gene variant and a second numerical consensus score reference interval for a pathogenic phenotype associated with a gene variant.
 8. The method of claim 7, further comprising comparing the overall consensus score against the first and second numerical consensus score reference intervals, and reporting a benign gene test outcome in an event the consensus score falls within the first numerical interval or reporting a pathogenic gene test outcome in an event the consensus score falls within the second numerical interval.
 9. The method of claim 1, further comprising obtaining a numerical representation of the interpretation obtained from each classifier, the numerical representation including at least one of mean, median, standard deviation, minimum, or maximum of numerical values output from the classifier.
 10. The method of claim 9, further comprising determining the correlation matrix as a function of numerical representations of the plurality of the classifiers.
 11. The method of claim 1, wherein determining the numerical correlation between each pair of classifiers includes determining Spearman's rank correlation coefficient for the pair of classifier.
 12. The method of claim 1, wherein reporting the genetic test interpretation includes graphically displaying the overall consensus score on the predetermined range.
 13. The method of claim 1, wherein reporting the genetic test interpretation includes displaying the consensus scores of the plurality of classifiers on a radial plot.
 14. The method of claim 1, further comprising reporting an uncertain outcome for the genetic test in an event the overall consensus score falls outside of the predetermined range.
 15. The method of claim 14, further comprising graphically reporting the uncertain outcome, the graphical report including display of the overall consensus score superimposed on the predetermined range.
 16. The method of claim 14, further comprising graphically reporting the uncertain outcome, the graphical report including display of the consensus scores of the plurality of classifiers on a radial plot.
 17. A non-transitory computer-readable medium encoded with a computer program comprising instructions executable by a processor for: selecting a plurality of classifiers, each trained using data relating to a gene variant and to a respective phenotype associated with the variant, for interpreting the genetic test; obtaining an interpretation of the genetic test from each selected classifier; determining a correlation matrix including a numerical correlation of each of a plurality of pairs of classifiers, the numerical correlation indicating a correlation between members of the respective pair; performing factor analysis of the correlation matrix to determine whether the interpretation obtained from a classifier is statistically independent from interpretations of remaining classifiers, and, in an event the classifier is not statistically independent from the interpretations of the remaining classifiers, obtaining a consensus score for each classifier as a function of a reduced matrix obtained from the factor analysis of the correlation matrix; and comparing an overall consensus score, calculated using the consensus scores determined for each classifier, against a predetermined range including a threshold; and when the overall consensus score falls within the range, reporting a genetic test interpretation associated with the threshold as an outcome of the genetic test
 18. The non-transitory computer-readable medium of claim 17, further comprising obtaining the consensus score for each classifier as a function of a linear sum of the interpretations obtained from the classifier in an event the interpretations of all classifiers are statistically independent from one another.
 19. The non-transitory computer-readable medium of claim 17, further comprising obtaining the consensus score for each classifier as a function of a scalar product of the reduced matrix and the interpretation obtained from the classifier.
 20. The non-transitory computer-readable medium of claim 17, wherein performing factor analysis of the correlation matrix includes performing at least one of regression modeling, common factor analysis, or principle component analysis of the correlation matrix. 